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ABSTRACT 



A method to provide automatic content-based video index- 
ing from object motion is described. Moving objects in 
video from a surveillance camera 11 detected in the video 
sequence using motion segmentation methods by motion 
segmentor 21. Objects are tracked through segmented data 
in an object tracker 22. A symbolic representation of the 
video is generated in the form of an annotated graphics 
describing the objects and their movement. A motion ana- 
lyzer 23 analyzes results of object tracking and annotates the 
graph motion with indices describing several events. The 
graph is then indexed using a rule based classification 
scheme to identify events of interest such as appearance/ 
disappearance, deposit/removal, entrance/exit, and motion/ 
rest of objects. Clips of the video identified by spatio- 
temporal, event, and object-based queries are recalled to 
view the desired video. 

22 Claims, 11 Drawing Sheets 
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MOTION BASED EVENT DETECTION 
SYSTEM AND METHOD 

This application claims priority under 35 USC §11 9(e) 
(1) of provisional application No. 60/011,106, filed Feb. 5, 
1996. This application is related to co-pending application 
Ser. No. 08/795,434 (71-2254$) entitled, "Object Detection 
Method and System for Scene Change Analysis in TV and 
IR Data" of Jonathan Courtney, et aL, filed Feb. 50, 1997. 
This application is incorporated herein by reference. 

TECHNICAL FIELD OF THE INVENTION 

This invention relates to motion event detection as used 
for example in surveillance. 

BACKGROUND OF THE INVENTION 

Advances in multimedia technology, including commer- 
cial prospects for video -on- demand and digital library 
systems, has generated recent interest in content-based video 
analysis. Video data offers users of multimedia systems a 
wealth of information; however, it is not as readily manipu- 
lated as other data such as text. Raw video data has no 
immediate "handles" by which the multimedia system user 
may analyze its contents. Annotating video data with sym- 
bolic information describing its semantic content facilitates 
analysis beyond simple serial playback. 

Video data poses unique problems for multimedia infor- 
mation systems that text does not. Textual data is a symbolic 
abstraction of the spoken word that is usually generated and 
structured by humans. Video, on the other hand, is a direct 
recording of visual information. In its raw and most common 
form, video data is subject to little human-imposed structure, 
and thus has no immediate "handles" by which the multi- 
media system user may analyze its contents. 

For example, consider an on-line movie screenplay 
(textual data) and a digitized movie (video and audio data). 
If one were analyzing the screenplay and interested in 
searching for instances of the word "horse" in the text, many 
text searching algorithms could be employed to locate every 
instance of this symbol as desired. Such analysis is common 
in on-line text databases. If, however, one were interested in 
searching for every scene in the digitized movie where a 
horse appeared, the task is much more difficult. Unless a 
human performs some sort of pre-processing of the video 
data, there are no symbolic keys on which to search. For a 
computer to assist in the search, it must analyze the semantic 
content of the video data itself. Without such capabilities, 
the information available to the multimedia system user is 
greatly reduced. 

Thus, much research in video analysis focuses on seman- 
tic content-based search and retrieval techniques. The term 
"video indexing" as used herein refers to the process of 
marking important frames or objects in the video data for 
efficient playback. An indexed video sequence allows a user 
not only to play the sequence in the usual serial fashion, but 
also to "jump" to points of interest while it plays. A common 
indexing scheme is to employ scene cut detection to deter- 
mine breakpoints in the video data. See H. Zang, A. 
Kankanhalli, and Stephen W. Smoliar, Automatic Partition- 
ing of Full Motion Video, Multimedia Systems, 1, 10-28 
(1993). Indexing has also been performed based on camera 
(i.e., viewpoint) motion, see A. Akutsu, Y. Tonomura, H. 
Hashimoto, and Y. Ohba, Video Indexing Using Motion 
Vectors, in Petros Maragos, editor, Visual Communications 
and Image Processing SPIE 1818, 1552-1530 (1992), and 
object motion, see M. Ioka and M. Kurokawa, A Method for 
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Retrieving Sequences of Images on the Basis of Motion 
Analysis, in Image Storage and Retrieval Systems, Proc. 
SPIE 1662, 35-46 (1992), and S. Y. Lee and H. M. Kao, 
Video Indexing-an approach based on moving object and 
s track, in Wayne Niblack, editor, Storage and Retrieval for 
Image and Video Databases, Proc. SPIE 1908, 25-36 
(1993). 

Using breakpoints found via scene cut detection, other 
researchers have pursued hierarchical segmentation to ana- 

10 lyze the logical organization of video sequences. For more 
on this, see the following: G. Davenport, T. Smith, and N. 
Pincever, Cinematic Primitives for Multimedia, IEEE Com- 
puter Graphics & Applications, 67-74 (1991); M. Shibata, 
A temporal Segmentation Method for Video Sequences, in 
Petros Maragos, editor, Visual Communications and Image 

15 Processing, Proc SPIE 1818, 1194-1205 (1992); D. 
Swanberg, C-F. Shu, and R. Jain, Knowledge Guided Pars- 
ing in Video Databases in Wayne Niblack, editor, Storage 
and Retrieval for Image and Video Databases ; Proc. SPIE 
1908, 13-24 (1993). In the same way that text is organized 

20 into sentences, paragraphs and chapters, the goal of these 
techniques is to determine a hierarchical grouping of video 
sub -sequences. Combining this structural information with 
content abstractions of segmented sub -sequences provides 
multimedia system users a top-down view of video data. For 

25 more details see F. Annan, R. Depommier, A Hsu, and M. 
Y. Chiu, Content-Based Browsing of Video Sequences, in 
Proceedings of ACM International Conference on 
Multimedia, (1994). 

Closed-circuit television (CCTV) systems provide secu- 

30 rity personnel a wealth of information regarding activity in 
both indoor and outdoor domains. However, few tools exist 
that provide automated or assisted analysis of video data; 
therefore, the information from most security cameras is 
under-utilized. 

35 Security systems typically process video camera output 
by either displaying the video on monitors for simultaneous 
viewing by security personnel and/or recording the data to 
time-lapse VCR machines for later playback. Serious limi- 
tations exist in these approaches: 

40 Psycho-visual studies have shown that humans are limited 
in the amount of visual information they can process in tasks 
like video camera monitoring. After a time, visual activity in 
the monitors can easily go unnoticed. Monitoring effective- 
ness is additionally taxed when output from multiple video 

45 cameras must be viewed. 

Time-lapse VCRs are limited in the amount of data that 
they can store in terms of resolution, frames per second, and 
length of recordings. Continuous use of such devices 
requires frequent equipment maintenance and repair. 

50 In both cases, the video information is unstructured and 
un-indexed. Without an efficient means to locate visual 
events of interest in the video stream, it is not cost-effective 
for security personnel to monitor or record the output from 
all available video cameras. 

55 Video motion detectors are the most powerful of available 
tools to assist in video monitoring. Such systems detect 
visual movement in a video stream and can activate alarms 
or recording equipment when activity exceeds a pre-set 
threshold. However, existing video motion detectors typi- 

60 cally sense only simple intensity changes in the video data 
and cannot provide more intelligent feedback regarding the 
occurrence of complex object actions such as inventory 
theft. 

65 SUMMARY OF THE INVENTION 

In accordance with one embodiment of the present 
invention, a method is provided to perform video indexing 
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from object motion. Moving objects are detected in a video ment of the present invention. In this view, a camera 11 
sequence using a motion segmentor. Segmented video provides input to a vision subsystem 13 including a pro- 
objects are recorded and tracked through successive frames. grammed computer which processes the incoming video 
The path of the objects and intersection with paths of the which has been digitized to populate a database storage 15. 
other objects are determined to detect occurrence of events. 5 The term camera as used herein may be a conventional 
An index mark is placed to identify these events of interest television (TV) camera or infrared (IR) camera. A user may 
such as appearance/disappearance, deposit/removal, then analyze the video information using an interface 17 
entrance/exit, and motion/rest of objects. including a computer to the database 15 via spatio-temporal, 
These and other features of the invention that will be event-, and object-based queries. The user interface 17 plays 
apparent to those skilled in the art from the following 10 vidco subsequences which satisfy the queries to a monitor 
detailed description of the invention, taken together with the 

accompanying drawings. FIG. 2 shows frames from a video sequence with content 

nncrorimnM nc tux: no awtmpc similar t0 mat found 10 sccarit y monitoring applications. In 

DESCRIPTION OF THE DRAWINGS ^ sequence> a person enters the scene? deposils a piece of 

FIG. 1 is an overview diagram of a system for automati- 15 paper, a briefcase, and a book, and then exits. He then 

cally indexing pre-recorded video in accordance with one re-enters the scene, removes the briefcase, and exits again, 

embodiment of the present invention; The time duration of this example sequence is about 1 

FIG. 2 is a sequence of frames of video (test sequence 1) minute; however, the action could have been spread over a 

with the frame numbers below each image; number of hours. By querying the AVI database 15, a user 

FIG. 3 illustrates points in the video sequence that satisfy 2 o can J um P to important events without playing the entire 

the query "show all deposit events"; sequence front-to-back. For example, if a user formed the 

FIG. 4 illustrates the relation between video data, motion <l uerv " show 411 deposit events in the sequence", the AVI 

segmentation and video meta-information; svstem 10 wcmld respond with sub-sequences depicting the 

FIG. 5 illustrates the Automatic Video Indexing system P erson depositing the paper, briefcase and book. FIG. 3 

architecture* 25 snows tne actua l result given by the AVI system in response 

rrrr* r 'n* * * *u #• * to this query where the system points to the placement of the 

FIG. 6 illustrates the motion segmentor; * __ __ . :Ti_ i_.i_t-i.i_ __• 

____„,_,.„ . 0 . . , / x paper, bnefcase and book, and boxes highlight the obiects 

FIG. 7 .llustrates motion segmentation example where (a) contribllti to ^ CV6nt 

is the reference image I ■ (b) Image I„; (c) absolute differ- 0 . j _ «u a\tt • • u . +1 

ence |D„-I„-Ij; (d) Threshold image T,,; (e) result of In processmg the video data, he AVI vision subsystem 13 

i n n w \ / » ft > \ / employs motion segmentation techniques to segment fore- 
morphological close operation; (f) result of connected com- JU j _ r _l i_ i j • l r 
r » i ground objects from the scene background in each frame. 

™„ « .,i * - ? For motion segmentation techniques see S. Yalamanchili, W. 

FIG. 8 illustrates reference image from a TV camera u ^ ^ j A ^ Extraclion of Movi 0bjcct 

modified to account for the exposed background region; Descriptions via Differencing, Computer Graphics and 

HG. 9 illustrates the output of the object tracking stage 35 j Processing, 18, 188-201 (1982); R. Jain, Segmenta- 

for a hypothetical sequence of 1-D frames where vertical Uon of Frame Sequences obtained by a Moving Observer, 

lines labeled "F„" represent frame numbers n and where JEE£ Transactions on Pattern Analysis and Machine 

primary links are solid lines and secondary links are dashed; Inte iu gencei 6 , 624-629 (1984); A. Shio and J. Sklansky, 

FIG. 10 illustrates an example motion graph for a Segmentation of People in Motion, in IEEE Workshop on 

sequence of 1-D frames; 4Q Visual Motion, 325-332 (1991); and D. Ballard and C. 

FIG. 11 illustrates stems; Brown, Computer Vision, Prentice -Hall, Englewood Cliffis, 

FIG. 12 illustrates branches; New Jersey (1982) to segment foreground objects from the 

FIG. 13 illustrates trails; scene background in each frame. It then analyzes the seg- 

FIG. 14 illustrates tracks; mented video to create a symbolic representation of the 

FIG 15 illustrates traces - 45 foreground objects and their movement. This symbolic 

FIG^ 16 illustrates indexing rules applied to FIG. 10; rc f rd of vid *° f rcfen f d to as lhc video " racta ; 

11 * * u- i j ■ ** f ... information (see FIG. 4). FIG. 4 shows the progression of 

FIG. 17 illustrates a graphical depiction of the query _..,,_ _- __/ 5 

Y=(CT VR EY the video data frames, the corresponding motion segmenta- 

B _i__4 *_« ... • _*_ . _ • _ 1100 an d the corresponding meta-information. This meta- 

FIG. 18 illustrates processing of temporal constraints; is stored in thc in mc form of m 

nG. 19 illustrates processing of object based constraints; directed graph appropriate for later indexing and 

FIG. 20 illustrates processing of spatial constraints; search. The user interface 17 operates upon this information 

FIG. 21 illustrates processing of event-based constraints; rather than the raw video data to analyze semantic content. 

FIG. 22 illustrates a picture of the "playback" portion of The vision subsystem 13 records in the meta-information 

the GUI; 5S the size, shape, position, time-stamp, and image of each 

FIG. 23 illustrates the query interface; object in every video frame. It tracks each object through 

FIG. 24 illustrates video content analysis using advanced successive video frames, estimating the instantaneous veloc- 

queries with video clips a, b, c and d; ity at each frame and determining the path of the object and 

FIG. 25 illustrates frames from test sequence 2; its intersection with the paths of other objects. It then 

FIG. 26 illustrates frames from test sequence 3; and 60 classifies objects as moving or stationary based upon veloc- 

FIG. 27 illustrates video indexing in a real-time system. itv measure s on their path. 

Finally, the vision subsystem 13 scans through the meta- 

DESCRIPTION OF THE PREFERRED information and places an index mark at each occurrence of 

EMBODIMENTS OF THE PRESENT cignt events of j ntcresl; appearance/disappearance, deposit/ 

INVENTION fi5 rem0 val, entrance/exit, and motion/rest of objects. This 

FIG. 1 shows a high-level diagram of the Automatic indexing is done using heuristics based on the motion of the 

Video Indexing (AVI) system 10 according to one embodi- objects recorded in the meta-information. For example, a 
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moving object that "spawns" a stationary object results in a 
"deposit" event. A moving object that intersects and then 
removes a stationary object results in a "removal" event. 

The system stores the output of the vision subsystem — the 
video data, motion segmentation, and meta-information — in 
the database 15 for retrieval through the user interface 17, 
The interface allows the user to retrieve a video sequence of 
interest, play it forward or backward and stop on individual 
frames. Furthermore, the user may specify queries on a 
video sequence based upon spatial-temporal, event-based, 
and object-based parameters. 

For example, the user may select a region in the scene and 
specify the query "show me all objects that are removed 
from this region of the scene between 8 am and 9 am." In this 
case, the user interface searches through the video meta- 
information for objects with timestamps between 8 am and 
9 am, then niters this set for objects within the specified 
region that are marked with "removal" event tags. This 
results in a set of objects satisfying the user query. From this 
set, it then assembles a set of video "clips" highlighting the 
query results. The user may select a clip of interest and 
proceed with further video analysis using playback or que- 
ries as before. 

The following is a description of some of the terms and 
notation used in the remainder of this application. 

A sequence S is an ordered set of N frames, denoted 
S={F 0 , F lf . . . , F^j}, where F„ is the frame number n in 
the sequence. 

A clip is a 4-tuple C=(S,f,s,l), where S is a sequence with 
N frames, and f, s, and 1 are frame numbers such that 
Oif^s^liN-1. Here, Fy-and F, are the first and last valid 
frames in the clip, and F s is the current frame. Thus, a clip 
specifies a sub-sequence with a state variable to indicate a 
"frame of interest". 

A frame F is an image I annotated with a timestamp t. 
Thus, frame number n is denoted by the pair F M =(I„, t„). 

An image I is an r by c array of pixels. The notation I (ij) 
indicates the pixel at coordinates (row i, column j) in I, 
wherein i=0, . . . ,r-l and j=0, . . . , c-1. For purposes of this 
discussion, a pixel is assumed to be an intensity value 
between 0 and 255. 

FIG. 5 shows the AVI system in detail. Note that the 
motion segmentor 21, object tracker 22, motion analyzer 23, 
recorder 24, and compressor 25 comprise the vision sub- 
system 13 of FIG. 1. Likewise, the query engine, 27, 
graphical user interface 28, playback device 29 and decom- 
pression modules 30 comprise the user interface 17. The 
subsequent paragraphs describe each of these components in 
detail. 

The current implementation of the AVI system supports 
batch, rather than real-time, processing. Therefore, frames 
are digitized into a temporary storage area 20 before further 
processing occurs. A real-time implementation would 
bypass the temporary storage 20 and process the video in a 
pipelined fashion. 

FIG. 6 shows the motion segmentor in more detail. For 
each frame F n in the sequence, the motion segmentor 21 
computes segmented image C M as 

C n mccomps(T h 's), 

where T h is the binary image resulting from thresholding the 
absolute difference of images I„ and I c at h, T h k is the 
morphological close operation on T A with structuring ele- 
ment k, and the function ccomps(-) performs connected 



components analysis resulting in a unique label for each 
connected region in image T^ k. The image T A is defined as 



1 if \D n {i,j)\±h 
0 otherwise 
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where D n , is the difference image 



for all pixels (i, j) in T h , 
of I„ and I c such that 



For noisy data (such as from an infrared camera), the image 
D„ may be smoothed via a low-pass filter to create a more 
consistent difference image. 

Finally, the operation a*k is defined as 

where © is the morphological dilation operator and 0 is the 
morphological erosion operator. 

FIG. 7 shows an example of this process. FIG. la is the 
reference image I 0 ; FIG. lb is the image I n ; FIG. 7c is the 
absolute difference |D„=I„-I 0 |; FIG. Id is the thresholded 
image T A , which highlights motion regions in the image; 
FIG. le is the result of the morophogical close operation, 
which joins together small regions into smoothly shaped 
objects; FIG. If is the result of connected components 
analysis, which assigns each detected object a unique label 
such as regions 1-4. This result is C„, the output of the 
motion segmentor. 

Note that the technique uses a "reference image" for 
processing. This is nominally the first image from the 
sequence, I 0 . For many applications, the assumption of an 
available reference image is not unreasonable; video capture 
is simply initiated from a fixed-viewpoint camera when 
there is limited motion in the scene. Following are some 
reasons why this assumption may fail in other applications: 

1. Gradual lighting changes may cause the reference 
frame to grow "out of date" over long video sequences, 
particularly in outdoor scenes. Here, more sophisti- 
cated techniques involving cumulative differences of 
successive video frames must be employed. 

2. The viewpoint may change due to camera motion. In 
this case, camera motion compensation must be used to 
"subtract" the moving background from the scene. 

3. A object may be present in the reference frame and 
move during the sequence. This causes the motion 
segmentation process to incorrectly detect the back- 
ground region exposed by the object as if it were a 
newly- appearing stationary object in the scene. 

A straightforward solution to problem 3 is to apply a test 
to non-moving regions detected by the motion segmentation 
process to determine if a given region is the result of either 
(1) a stationary object present in the foreground or (2) 
background exposed by a foreground object present in the 
reference image. 

In the case of video data from a TV camera, this test is 
implemented based on the following observation: if the 
region detected by the segmentation of image I„ is due to the 
motion of an object present in the reference image (i.e., due 
to "exposed background"), a high probability exists that the 
boundary of the segmented region will match intensity edges 
detected in I 0 . If the region is due to the presence of a object 
in the current image, a high probability exists that the region 
boundary will match intensity edges in I„. The test is 
implemented by applying an edge detection operator (See D. 
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moving object that "spawns" a stationary object results in a components analysis resulting in a unique label for each 

"deposit" event. A moving object that intersects and then connected region in image T^k. The image T k is denned as 
removes a stationary object results in a "removal" event. 

The system stores the output of the vision subsystem — the [ i if \D n (i, j)\ * h 

video data, motion segmentation, and meta-information — in 5 Th ^ otherwise 

the database 15 for retrieval through the user interface 17. i, o se 
The interface allows the user to retrieve a video sequence of 

interest, play it forward or backward and stop on individual f or p jx els ^ j) in j h> wnere D ^ ^ me difference image 

frames. Furthermore, the user may specify queries on a 0 f m d I such that 

video sequence based upon spatial-temporal, event-based, 10 

and object-based parameters. £>„(/, J)-J„& /K 0 (t f). 

For example, the user may select a region in the scene and ^ • , ✓ , r • <• v , - 

specify the query "show me all objects that axe removed !°r ^ data ( su * as ! tom 1 311 mfrar ^ camera )' 11)6 ma S e 

from this region of the scene between 8 am and 9 am." In this D » ma 7 be soothed vra a low-pass filter to create a more 

case, the user interface searches through the video meta- 15 ™»8f- . 

information for objects with timestamps between 8 am and FlnaU * the °P eraUon a k 15 defined 18 

9 am, then niters this set for objects within the specified a -k~(a®k)Gk, 
region that are marked with "removal" event tags. This 

results in a set of objects satisfying the user query. From this where © is the morphological dilation operator and 0 is the 

set, it then assembles a set of video "clips" highlighting the 20 morphological erosion operator. 

query results. The user may select a clip of interest and FIG. 7 shows an example of this process. FIG. la is the 

proceed with further video analysis using playback or que- reference image I 0 ; FIG. lb is the image I„; FIG. 7c is the 

ries as before. absolute difference |D M «I M -I 0 |; FIG. Id is the thresholded 

The following is a description of some of the terms and image T A , which highlights motion regions in the image; 

notation used in the remainder of this application. FIG. le is the result of the morophogical close operation, 

A sequence S is an ordered set of N frames, denoted which J oins to S ether smal1 ™& oos mt0 smoothly shaped 

S={F 0 , F lf . . . , F~ X where F M is the frame number n in ob '^ cXs } FI , G - 1 7 / is the "f* of connected components 

the sequence analysis, which assigns each detected object a unique label 

. , „ _ t . „ . . , such as regions 1-4. This result is C„, the output of the 

A clip is a 4-tuple 0(S,f,s,l), where S is a sequence with 30 mQ{ioQ segmentor 

N J™™ S >™ d U S ' 1 ™ fnmK l n r bCrS J f lCh ^1 Note that the technique uses a "reference image" for 

O^s^N-1. Here, Fyand IF, are the first and last vahd ssi ^ is nominallv t he first image from the 

frames in the clip, and F, is the current frame. Thus a clip ge , For ma applicationS) the assump ti 0 n of an 

specifies a sub-sequence with a state vanable to indicate a ^ tyanMs reference image is not unreasonable; video capture 

frame of interest . 35 ^ simply initiated from a fixed-viewpoint camera when 

A frame F is an image I annotated with a timestamp t. there is limited motion in the scene. Following are some 

Thus, frame number n is denoted by the pair F„=(I„, tj. reasons why this assumption may fail in other applications: 

An image I is an r by c array of pixels. The notation I (ij) 1, Gradual lighting changes may cause the reference 

indicates the pixel at coordinates (row i, column j) in I, 4Q frame to grow "out of date" over long video sequences, 

wherein i=0, . . . ,r-l and j=0, . . . , c-1. For purposes of this particularly in outdoor scenes. Here, more sophisti- 

discussion, a pixel is assumed to be an intensity value cated techniques involving cumulative differences of 

between 0 and 255. successive video frames must be employed. 

FIG. 5 shows the AVI system in detail. Note that the 2. The viewpoint may change due to camera motion. In 
motion segmentor 21, object tracker 22, motion analyzer 23, 45 this case, camera motion compensation must be used to 
recorder 24, and compressor 25 comprise the vision sub- "subtract" the moving background from the scene, 
system 13 of FIG. 1. Likewise, the query engine, 27, 3. A object may be present in the reference frame and 
graphical user interface 28, playback device 29 and decom- move during the sequence. This causes the motion 
pression modules 30 comprise the user interface 17. The segmentation process to incorrectly detect the back- 
subsequent paragraphs describe each of these components in 50 ground region exposed by the object as if it were a 
detail. newly-appearing stationary object in the scene. 

The current implementation of the AVI system supports A straightforward solution to problem 3 is to apply a test 

batch, rather than real-time, processing. Therefore, frames to non-moving regions detected by the motion segmentation 

are digitized into a temporary storage area 20 before further process to determine if a given region is the result of either 

processing occurs. A real-time implementation would 55 (1) a stationary object present in the foreground or (2) 

bypass the temporary storage 20 and process the video in a background exposed by a foreground object present in the 

pipelined fashion. reference image. 

FIG. 6 shows the motion segmentor in more detail. For In ^ casc of vidco data from a TV camera, this test is 

each frame F„ in the sequence, the motion segmentor 21 implemented based on the following observation: if the 

computes segmented image C„ as 60 rc S ion detected by the segmentation of image I M is due to the 

motion of an object present in the reference image (i.e., due 

c^compsiT^s), to "exposed background"), a high probability exists that the 

boundary of the segmented region will match intensity edges 

where T h is the binary image resulting from thresholding the detected in I 0 . If the region is due to the presence of a object 

absolute difference of images I n and \ 0 at h, T^*k is the 65 in the current image, a high probability exists that the region 

morphological close operation on T h with structuring ele- boundary will match intensity edges in I„. The test is 

ment k, and the function ccomps(-) performs connected implemented by applying an edge detection operator (See D. 
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Ballard and C. Brown, Computer Vision, Prentice-Hall, determined using linear prediction of V-object positions and 

Englewood Cliffs, NJ., 1982) to the current and reference a "mutual nearest neighbor" criterion via the following 

images and checking for coincident boundary pixels in the procedure: 

segmented region of C„. 1. For each V-object V M / 'EV n , predict its position in the 

In the case of video data from an IR camera, foreground 5 next frame using 
objects may not have easily detectable edges due to heat 

diffusion and image blurring. In data from some cameras, ^ p -/*/+^/-('««-i-0> 

however, objects exhibit a contrasting halo due to opto- , • „ . if , • * , ^ 

mechanical image sharpening. See A. Rosenfeld and A. Kak, where "f «f ******* "ntroid of V' in C * «« 

DigitalPictur e Pwce S sing,2ed.,Vobiin C l, Academic Press, in * * Tv"? ^ C "' a? * .^mated 

New York NY 1982 TT^us the test mav be imolemented 10 ( forward ) velocity of V/, and t„ +1 and t* are the Umestamps 

New YorJc, N.Y., 1982. Ihus, the test may be implemented of frames p ^ p respectively Initiall „ velocity 

by comparing the variance of pixel intensities within the estimate is set to v p =(0 0) 

region of interest in the two images. Since background 2 Por each y determine the V-object in the next 

regions tend to exhibit constant p.xe intensmes the van- frame am ■ u 

ance will be highest for the image containing the foreground 15 denoted n p j^ us 
object. 

The object detection method for scene change analysis in 

TV and IR data is described in above cited application of n » = v ^ 3 W " ^ * K " ^ l|v * * r 
Courtney, et al. incorporated herein by reference and in 

Appendix A. .... 20 3. For every pair (V/, n/=V_ +1 ') for which no other 
If either test supports the hypothesis that the region in V-objects in V„ have V w+1 r as a nearest neighbor, estimate 
question is due to exposed background, the reference image v r. ^ (f orwarc i) velocity of V 1 r as 
is modified by replacing the object with its exposed back- 
ground region (see FIG. 8). ^ r j (1) 

No known motion segmentation technique is perfect. The 2S K+i = _ f n ; 

following are errors typical of many motion segmentation n * 1 
techniques: 

1. True objects will disappear temporarily from the otherwise, set u„+i r =(0,0). 

motion segmentation record. This occurs when there is These steps are performed for each C M , n=0,l, . . . , N-2. 

insufficient contrast between an object and an occluded 30 Steps 1 and 2 find nearest neighbors in the subsequent frame 

background region, or if an object is partially occluded for each V-object. Step 3 generates velocity estimates for 

by a "background" structure (for instance, a tree or V-objects that can be unambiguously tracked; this informa- 

pillar present in the scene). tion is used in step 1 to predict V-object positions for the next 

2. False objects will appear temporarily in the motion frame. 

segmentation record. This is caused by light fluctua- 35 Next, steps 1-3 are repeated for the reverse sequence, i.e., 

tions or shadows cast by moving objects. C„, n=N-l, N-2, . . . , 1. This results in a new set of 

3. Separate objects will temporarily join together. This predicted centroids, velocity estimates, and nearest neigh- 
typically occurs when two or more objects are in close bors for each V-object in the reverse direction. Thus, the 
proximity or one object occludes another object. V-objects are tracked both forward and backward through 

4. Single objects will split into two regions and then 40 me sequence. The remaining steps are then performed: 
rejoin. This occurs when a portion of an object has 4 - V-objects V/ and V K+ / are mutual nearest neighbors 
insufficient contrast with the background it occludes. tf n / B3V „ +1 r and n n+ /=V„ J . (Here, n/ is the nearest neigh- 

Instead of applying incremental improvements to relieve bor of » thc forward direction, and n^' is the nearest 

the shortcomings of motion segmentation, the AVI technique neighbor of V M+ /, in the reverse direction.) For each pair of 

addresses these problems at a higher level where informa- 45 mutual n ™ TCS{ neighbors (V„ , V M+1 0 create a primary link 

tion about the semantic content of the video data is more ^ om ^« t0 * 

readily available. The object tracker and motion analyzer 5 - For each v / ev « without a mutual nearest neighbor, 

units described later employ object trajectory estimates and create a secondary link from V/ to n/ if the predicted 

domain knowledge to compensate for motion segmentation centroid is within e of n/ where e is some small 

inaccuracies and thereby construct a more accurate record of 50 distance. 

the video content. 6. For each V M+ / in V n+1 without a mutual nearest 

The motion segmentor 21 output is processed by the 
object tracker 22. Given a segmented image C„ with P 

uniquely -labeled regions corresponding to foreground The object tracking procedure uses the mutual nearest 

objects in the video, the system generates a set of features to 55 neighbor criterion (step 4) to estimate frame-to-frame 

represent each region. This set of features is named a V-object trajectories with a high degree of confidence. Pairs 

"V-object" (video-object), denoted V n p , p=*l, . . . ,P. A of mutual nearest neighbors are connected using a "primary" 

V-object contains the label, centroid, bounding box, and link to indicate that they are highly likely to represent the 

shape mask of its corresponding region, as well as object same real-world object in successive video frames, 

velocity and trajectory information by the tracking process. 60 Steps 5-6 associate V-objects that are tracked with less 

V-objects are then tracked through the segmented video confidence but display evidence that they might result from 

sequence. Given segmented images C„ and C n ^ with the same real-world object. Thus, these objects are joined by 

V-objects V^jV^; p=l, . . . ,P} and V W4>1 «{V n+1 q ; "secondary" links. These steps are necessary to account for 

q=l, . . . ,Q}, respectively, the motion tracking process the "split" and "join" type motion segmentation errors as 

"links" V-objects W n p and V n+1 9 if their position and esti- 65 described above. 

mated velocity indicate that they correspond to the same The object tracking process results in a list of V-objects 

real-world object appearing in frames F n and F n + V This is and connecting links that form a directed graph (digraph) 
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representing the position and trajectory of foreground moving if all the branches it contains are moving, 

objects in the video sequence. Thus, the V-objects are the Otherwise, the trail is classified as unknown. Trails "V" and 

nodes of the graph and the connecting links are the arcs. This "X" in FIG. 13 are stationary; the remainder are moving, 

motion graph is the output of the object tracker. A track K^L^ G„ . . . , L^,^ G^.^, L^*} is a dipath 

FIG. 9 shows a motion graph for a hypothetical sequence S of maximal size containing trails {L ( : 1 = i=N^}, and con- 

of one-dimensional frames. Here, the system detects the necting dipaths {G,-: l^i^NjJ. For each GjEK there must 

appearance of an object at A and tracks it to the V-object at exist a dipath 

B. Due to an error in motion segmentation, the object splits h(\v'g v M 

at D and E, and joins at F. At G, the object joins with the * l '* " M * 

object tracked from C due to occlusion. These objects split 10 (where V/ is the last V-object in h it and V,^ 1 , is the first 

at H and I. Note that primary links connect the V-objects that V-object in L i+1 ), such that every VjSH meets the require - 

were most reliably tracked. ment 

The motion analyzer 23 analyzes the results of the object , , _ * 

tracker and annotates the motion graph with index marks *** J 

describing several events of interest. This process proceeds 15 where fa 1 is the centroid of V/, u/ is the forward velocity of 

in two parts: V-object grouping and V-object indexing. FIG. V/, Oy-r/) is the time difference between the frames con- 

10 shows an example motion graph for a hypothetical taining Vy and V/, and fa is the centroid of V ; -. Thus, 
sequence of 1-D frames discussed in the following sections. Equation 4 specifies that the estimated trajectory of V-object 

First, the motion analyzer hierarchically groups V-objects V/ must intersect every V ; - on path H. 

into structures representing the paths of objects through the 20 A track represents the trajectory estimate of an object that 

video data. Using graph theory terminology See G. Chart- may undergo occlusion by a moving object one or more 

land and O. Oellennann, Applied and Algorithmic Graph times in a sequence. The motion analyzer uses Equation 4 to 

Theory, McGraw-Hill, New York (1993), five groupings are attempt to follow an object through frames where it is 

defined for this purpose: occluded. FIG. 14 labels V-objects belonging to tracks with 

A stem M={V ( : i— 1,2, . . . ,N M } is a maximal -size, 25 the letters a, p, x, d, and e. Note that track "x" J oms tra ^ s 

directed path (dipath) of two or more V-objects containing "V" and "X", 

no secondary links, meeting all of the following conditions: A track and the V-objects it contains are classified as 

stationary if all the trails it contains are stationary, and 

outdegrcc 00 ^3 for 1 $i^N fl) moving if all the trails it contains are moving. Otherwise, the 

30 track is classified as unknown. Track "x" in FIG. 14 is 
stationary; the remaining tracks are moving, 

either A trace is a maximal-size, connected subdigraph of 

V-objects. A trace represents the complete trajectory of an 

Mi-&r • • • -Pnu ( 2 ) object and all the objects with which it intersects. Thus, the 

35 motion graph in FIG. 10 contains two traces: one trace 

or extends from F 2 to F 7 ; the remaining V-objects form a 

P\*th* «Wv ( 3 ) sec °rid trace. FIG. 15 labels V-objects on these traces with 

w the numbers "1" and "2", respectively, 

where fa is the centroid of V-object V ( -EM. Note that the preceding groupings are hierarchical, i.e., 

Thus, a stem represents a simple trajectory of a stationary 40 for every trace E, there exists at least one track K, trail L, 

object through two or more frames. FIG. 11 labels V-objects branch B, and stem M such that E^K X ^3 

from FIG. 10 belonging to separate stems with the letters Furthermore, every V-object is a member of exactly one 

"A" through "J". trace. 

Stems are used to determine the "state" of real-world The motion analyzer scans the motion graph generated by 

objects, i.e. whether they are moving or stationary. If Equa- 45 the object tracker and groups V-objects into stems, branches, 

tion 2 is true, then the stem is classified as stationary; if trails, tracks, and traces. Thus, these four definitions are used 

Equation 3 is true, then the stem is classified as moving. FIG. to characterize object trajectories in various portions of the 

11 highlights stationary stems; the remainder are moving. motion graph. This information is then used to index the 
A branch B={V,-: i«l,2, . . . ,N B } is a maximal-size dipath video according to its object motion content. 

of two or more V-objects containing no secondary links, for 50 Eight events of interest are defined to designate various 

which outdegree (V-)^l for l^i^N^ and indegree (V.)^l motion events in a video sequence. 

for 1 ^i^N 5 , FIG. 12 labels V-objects belonging to branches Appearance— An object emerges in the scene. 

with the letters "K" through "S". A branch represents a Disappearance— An object disappears from the scene. 

highly reliable trajectory estimate of an object through a r . K .... . (L 

6 . J „ c J & Entrance — A moving object enters in the scene, 

series of frames. 55 _ . . t 7 . r . 

If a branch consists entirely of a single stationary stem, Exit ~ A movin S object exits from the scene, 

then it is classified as stationary; otherwise, it is classified as Deposit— An inanimate object is added to the scene, 

moving. Branches "NT and "P" in FIG. 12 (highlighted) are Removal— An inanimate object is removed from the 

stationary; the remainder are moving. scene. 

A trail L is maximal-size dipath of two or more V-objects 60 Motion— An object at rest beings to move, 

that contains no secondary links. This grouping represents Rest — A moving object comes to a stop, 

the object tracking stage's best estimate of an object trajec- These eight events are sufficiently broad for a video 

tory using the mutual nearest neighbor criterion. FIG. 13 indexing system to assist the analysis of many sequences 

labels V-objects belonging to trails with the letters "T" containing multiple moving objects, 

through "Y". 65 For example, valuable objects such as inventory boxes, 

A trail and the V-objects it contains are classified as tools, computers, etc., can be monitored for theft (i.e., 

stationary if all the branches it contains are stationary, and removal), in a security monitoring application. 
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Likewise, the traffic patterns of automobiles can be ana- FIG. 16 shows all the indexing rules applied to the 

lyzed (e.g., entrance/exit and motion/rest), or the shopping example motion graph of FIG. 10. Note that despite the 

patterns of retail customers recorded (e.g., motion/rest and occlusion of the stationary object at frame F 8 , the technique 

removal). correctly places a single pair of "deposit" and "removal" 

After the V-object grouping process is complete, the 5 indices at frames F 3 and F 12 , respectively, 

motion analyzer has all the semantic information necessary The recorder writes the video stream and meta- 

to identify these eight events in a video sequence. For each information into the video database for later retrieval. Since 

V-object V in the graph, the following rules are applied: the meta-information record lends itself to an object- 

1. If V is moving, the first V-object in a track (i.e., the oriented representation, we experimented using the ARPA 
"head"), and indegree (V)>0, place an index mark 10 Open Object-Oriented Database (OODB) developed at 
designating an appearance event at V. Texas Instruments.(See David L. Wells, et al., "Architecture 

2. If V is stationary, the head of a track, and indegree of an °P eD Object-Oriented Database Management 
(V)=0, place an index mark designating an appearance System," IEEE Computer, pp. 74-82, October 1992.) The 
event at V. Open OODB allows straight-forward storage and retrieval of 

* tc\i - - *i_ i * \/ l* « • * i /• t . 15 Ihe meta-information in an object-oriented fashion. The 

3. If V is moving, the last V-obtect in a track (i.e., the . , . , , , . , , J , . 4t _ _ „»„ n 

■ i /\fv n i j- input video data may also be stored in the Open OODB on 

tail ), and outdegree (V)>0, place a disappearance % r , . . 

index mark at V a " ame "P er "" ame Dasis J however, we found it most efficient 

„ . . 1 , 10 simply record the incoming video to a "flat" file refer- 

4. If Vis stationary, the tail of a track, and outdegree (V)0, cnced by objccts m (he 00DB 

place a disappearance index mark at V 2Q 0ptionallV) me video me ta-information can also be used 

5. If V is non-stationary (i.e. moving or unknown), the to compress the video data for maximum storage efficiency, 
head of a track, and indegree (V)=0, place an entrance Recall that each V-object records a shape mask of its 
index mark at V. real-world object obtained from motion segmentation. Since 

6. If V is non-stationary, the tail of a track, and outdegree the motion segmentation process captures the salient object 
(V)=0, place an exit index mark at V. 25 motion in the video, the input data may be compressed 

7. If V is stationary, the head of a track, and indegree substantially by recording this information to the video 
(V)=l, place a deposit index mark at V. database rather than the entire video sequence. 

8. If V is stationary, the tail of a track, and outdegree In ^ case > the reference frame, F 0J is recorded in 
(V)-l, place a removal index mark at V compressed form, perhaps using the JPEG still-picture corn- 
Rules 1-8 use track groupings to index the video at the 3° pression standard. Then, information describing individual 

beginning and end of individual object trajectories. Note, objects relative to the reference frame is recorded: the 

however, that rules 7 and 8 only account for the object position and shape of the V-object region mask and its 

deposited or removed from the scene; they do not index the corresponding image data. The mask is efficiently run-length 

V-object that caused the deposit or remove event to occur. encoded; the V-object image data is then JPEG-encoded as 

For this purpose, we define two additional events 35 wel1 - 0n playback, the system first decodes F 0 , then decodes 

Depositor-Amoving object adds an inanimate object to ^ V -° b i ect for e f h subsequent frame and maps 

the scene mese onto me re * erence frame using the V-object region 

t . 4 . masks. Using such a storage scheme, significant amounts of 

Remover — A moving object removes an inanimate object ., iT * j *• i *- j- i 

f , & J J video can be stored on conventional magnetic disks at 

trom the scene 4Q compressioi] ratios of 30 _ t0 250-to-l. 

and apply two or more rules: ^Jf AX/T . . . . , , . c 

rr J The AVI query engine retrieves video data from the 

9. If V is adjacent to a V-object with a deposit index, place database in response to queries generated at the graphical 
a depositor index mark at V. user mterfacet A valid query Y takes the form 

10. If V is adjacent from a V-object with a removal index, 

place a remover index mark at V 45 Y={C, zyuE), 

Rules 9-10 provide a distinction between the subject and wnere 
object of deposit/remove events. This distinction is only 

necessary when the subject and object of the event must be ^ a V1 eo C1 V> 

distinguished. Otherwise, depositor/remover events are T -0& V) specifies a time interval within the clip, 

treated identically to deposit/remove events. 50 v is a V-object within the clip meta-information, 

Finally, the indexing process applies rules to account for R is a spatial region in the field of view, and 

the start and stop events: E is an object-motion event. 

11. If V is the tail of a stationary stem M, and the head of The clip C specifies the video sub-sequence to be pro- 
a moving stem M y for which iMj^h^ and iM^h^ cessed by the query, and the (optional) values of T, V, R, and 
then place a motion index mark at V. Here, h^ is a 55 E define the scope of the query. Using this form, the AVI 
lower size limit of stems to consider. system user can make such a request as "find any occurrence 

12. If V is the tail of a moving stem M, and the head of of this object being removed from this region of the scene 
a stationary stem M ; - for which (Mjlh^ and |M^h M , between 8am and 9am." Thus, the query engine processes Y 
then place a rest index mark at V. bv finding all the video sub-sequences in C that satisfy, T, V, 

The output of the motion analyzer 23 is a directed graph 60 R> aa< * E. 

describing the motion of foreground objects annotated with In processing a given query, the query engine retrieves the 

object-based index marks indicating events of interest in the V-object graph G corresponding to clip C from the video 

video stream. Thus, the motion analyzer 23 generates from database, and performs the following steps: 

the motion segmentation data a symbolic abstraction of the 1. If T-(t„ t ; ) is specified in the query, G is truncated to 

actions and interactions of foreground objects in the video. 65 a subgraph spanning frames F, to F r 

This approach enables content -based navigation and analy- 2. If V is specified, G is further truncated to include only 

sis of video sequences that would otherwise be impossible. the trace containing V 



01/14/2004, EAST Version: 1.4.1 



5,969 ; 

13 

3. If V belongs to a track, G is further truncated to 
included to include only the track containing V. 

4. If R is specified, G is truncated to include only those 
V-objects whose shape mask intersect the specified 
spatial region. 5 

5. If E is specified, G is truncated to include only those 
V-objects with event indices matching E. 

6. If E is not specified, G is truncated to include only those 
V-objects V with indegree(V)=0, i.e., the source nodes 

in G. This reduces the result to include only the first jq 
occurrence of real- world objects meeting the require- 
ments of V, T, and R. 

Thus, step 1 satisfies the temporal query constraints; steps 
2 and 3 satisfy the object-based constraints by restricting the 
search to the most reliable path of V in the motion graph; 
step 4 filters V-objects to meet the spatial constraints; and 
steps 5 and 6 filter V-objects to match the specified event. 
The resulting graph G then contains only V-objects satisfy- 
ing all the constraints of the query. 

FIG. 17 is a graphical depiction of a query Y«(C, T, V, R, 
E) applied to the V-object graph of FIG. 10; i.e., "show if 20 
object V exits the scene in region R during the time interval 
T". FIGS. 18-21 illustrate the steps performed by the query 
engine on this sequence. 

Finally, for each V-object V,. satisfying the query, the 
query engine generates a result, Rf=(Q, V,-), consisting of a 25 
clip, C„ and a pointer to the V-object. The first and last 
frames of C ( - are set to reflect the time constraint of the query, 
T, if specified; otherwise, they are set to those of C, the clip 
specified in the query. The "frame of interest" of C f is set to 
the frame containing V,-. These results are sent to the 30 
graphical user interface for display. 

A graphical user interface (GUI) 28 enables users to 
analyze video sequences via spatial, temporal, event-, and 
object-based query processing. FIG. 22 shows a picture of 
the "playback" portion of the GUI. The interface allows the 35 
user to select video clips for analysis and play them using 
VCR -like controls (i.e., forward, reverse, stop, step -forward, 
step back). The GUI 28 also provides a system "clipboard" 
for recording intermediate analysis results. For example, the 
clipboard shown in FIG. 23 contains three clips, the result of 40 
a previous query by the user. The user may select one of 
these clips and pose a query using it. The resulting clip(s) 
would then by pushed onto the top of the clipboard stack. 
The user may peruse the stack using the button-commands 
"up", and "down" and "pop". 45 

FIG. 25 shows the query interface to the AVI system. 
Using this interface, the user may pose full queries of the 
form Y=(C, T, V, R, E) as described above. Using the "Type" 
field, the user may specify any combination of the four query 
types. The query interface provides fields to set parameters 50 
for temporal and event-based queries; parameters for spatial 
and object-based queries may be set using the mouse inside 
the video playback window shown in FIG. 24. After speci- 
fying the query type and parameters, the user executes the 
"Apply" button-command to pose the query to the AVI 55 
system. The resulting clips are then posted to the system 
clipboard. 

FIG. 2 shows frames from a example video sequence with 
motion content characteristic of security monitoring appli- 
cations. In this sequence, a person enters the scene, deposits 60 
a piece of paper, a briefcase, and a book, and then exits. He 
then re -enters the scene, removes the briefcase, and exits 
again. If a user forms the query "find all deposit events", the 
AVI system will respond with video clips depicting the 
person depositing the paper, briefcase and book- FIG. 3 65 
shows the actual result given by the AVI system in response 
to this query. 
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FIG. 24 demonstrates how more complex queries can be 
applied. After receiving the three clips of FIG. 3 in response 
to the query "show all deposit events", the AVI system user 
is interested on learning more about fate of the briefcase in 
the sequence of FIG. 3. First, the user retrieves the clip 
highlighting frame F 78 (shown in FIG. 24(a)) from the 
clipboard and applies the query "find entrance events of this 
object" to the person shown depositing the briefcase. The 
system responds with a single chip showing the first instance 
of the person entering the scene, as shown in FIG. 24(b). The 
user can play the clip at this point and observe the person 
carrying the briefcase into the room. 

Next, the user applies the query "find removal events 
(caused by) this object" to the person carrying the briefcase. 
The system responds by saying there are no such events. 
(Indeed, this is correct because the person removes no 
objects until after he leaves and re-enters the room — at that 
point, the person is defined as a different object.) 

The user then returns to the original clip FIG. 24(a) by 
popping the clipboard stack twice. Then the user applies the 
query "find removal events of this object" to the briefcase. 
The system responds with a single clip of the second 
instance of the person removing the briefcase, as shown in 
FIG. 24(c). 

Finally, the user specifies the query "find exit events of 
this object" to the person removing the briefcase. The 
system then responds with the single clip of the person as he 
leaves the room (with the briefcase), as shown in FIG. 24(d). 

The video indexing technique described in this applica- 
tion was tested using the AVI system on three video 
sequences containing a total of 900 frames, 18 objects, and 
44 events. The sequences where created as mock-ups of 
different domains of scene monitoring. 

Test Sequence 1 (i.e., the "table" sequence) is character- 
istic of an inventory or security monitoring application (see 
FIG. 2). In it, a person adds and removes various objects 
from a room as recorded by an overhead camera. It contains 
300 frames captured at approximately 10 frames per second 
and 5 objects generating 10 events. The sequence contains 
entrance/exit and deposit/removal events, as well as two 
instances of object occlusion. 

Test Sequence 2 (the "toys" sequence) is characteristic of 
a retail customer monitoring application (see FIG. 25). In it, 
a customer stops at a store shelf, examines different 
products, and eventually takes one with him. In contains 285 
frames at approximately 10 frames per second and 4 objects 
generating 14 events. This is the most complicated of the test 
sequences: it contains examples of all eight events, displays 
several instances of occlusion, and contains three fore- 
ground objects in the initial frame. 

Test Sequence 3 (the "park" sequence) is characteristic of 
a parking lot traffic monitoring application (see FIG. 26). In 
it, cars enter the parking lot and stop, drivers emerge from 
their vehicles, and pedestrians walk through the field of 
view. It contains 315 frames captured at approximately 3 
frames per second and 9 objects generating 20 events. 
Before digitalization, the sequence was first recorded to 8 
mm tape with consumer-grade equipment and is therefore 
the most "noisy" of the test sequences. 

The performance of the AVI system was measured by 
indexing each of the test sequences and recording its success 
or failure at detecting each of the eight event indices. Tables 
1-3 report event detection results for the AVI system on the 
three test sequences. For each event, the tables report the 
number of such events actually present the sequence, the 
number found by the AVI system, the Type I (false negative) 
errors, and the Type II (false positive) errors. 
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Of the 44 total events in the test sequences, the AVI 
system displays 10 Type II errors but only one Type I error. 
Thus, the system is conservative and tends to find at least the 
desired events. 

The system performed the worst on test Sequence 3, 
where it displayed the only Type I error and 8 of the 10 total 
Type II errors. This is primarily due to three reasons: 

1. Noise in the sequence, including vertical jitter from a 
poor frame-sync signal, resulted in very unstable 
motion segmentation masks. Thus, stationary objects 
appear to move significantly. 

2. The method used to track objects through occlusions 
presently assumes constant object trajectories. A 
motion tracking scheme that is more robust in the 
presence of rapidly changing trajectories will result in 
fewer false positives for many of the events. See S. 
Intille and A. Bobick, Closed-World Tracking, in Pro- 
ceedings of the Fifth International Conference on Com- 
puter Vision, 672-678 (1995). 

3. No means to track objects through occlusion by fixed 
scene objects is presently used. The light pole in the 
foreground of the scene temporarily occludes pedestri- 
ans who walk behind it, causing pairs of false entrance/ 
exit events. 

However, the system performed very well on Test 
Sequences 1 and 2 despite multiple simultaneous occlusions 
and moving shadows. And in all the sequences, the system 
is sufficiently robust to accurately respond to a large number 
of object-specific queries. 

TABLE 1 

Event detection results for Test Sequence 1 





Actual 


Detected 


Type I 


Typen 


Appearance 


0 


0 


0 


0 


Disappearance 


2 


2 


0 


0 


Entrance 


2 


2 


0 


0 


Exit 


2 


3 


0 


1 


Deposit 


3 


3 


0 


0 


Removal 


1 


1 


0 


0 


Motion 


0 


0 


0 


0 


Rest 


0 


0 


0 


0 


Total 


10 


11 


0 


1 



TABLE 2 



Event detection results for Test Sequence 1 





Actual 


Delected 


Type I 


Typen 


Appearance 


3 


3 


0 


0 


Disappearance 


2 


2 


0 


0 


Entrance 


1 


1 


0 


0 


Exit 


1 


2 


0 


1 


Deposit 


2 


2 


0 


0 


Removal 


3 


3 


0 


0 


Motion 


1 


1 


0 


0 


Rest 


1 


1 


0 


0 


Total 


14 


15 


0 


1 
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TABLE 3 



Event detection results for Test Sequence 3 





Actual 


Detected 


Type I 


Type II 


Appearance 


2 


3 


0 


1 


Disappearance 


0 


2 


0 


2 


Entrance 


7 


8 


0 


1 


Exit 


8 


9 


0 


1 


Deposit 


0 


0 


0 


0 


Removal 


0 


0 


0 


0 


Motion 


0 


1 


0 


1 


Rest 


3 


4 


1 


2 


Total 


20 


27 


1 


8 



The video indexing system described here may also be 
implemented as a real-time system, as, for example, in an 
advanced video motion detector. FIG. 27 shows a diagram of 
such implementation. Here, the vision subsystem 100 pro- 

20 cesses the output of the camera 101 frame -by-frame, and 
continuously updates a motion graph annotated with event 
index marks. An event scanner 103 continuously reads the 
motion graph updates and searches for motion events as 
specified by p re-set watchpoints. These watchpoints may 

25 take the same form as queries from the AVI user interface, 
i.e. Y=(C,T,V,R,E). When the criteria for one of the watch- 
points is met, the event scanner signals an actuator 105 (such 
as an alarm). 

OTHER EMBODIMENTS 

Although the present invention and its advantages have 
been described in detail, it should be understood that various 
changes, substitutions and alterations can be made herein 
without departing from the spirit and scope of the invention 
as defined by the appended claims. 
What is claimed is: 

1. A method of providing video indexing comprising the 
steps of: 

(a) detecting objects in video to provide detected objects 
comprising the step of performing motion segmenta- 

40 tion including image differencing, thresholding, mor- 
phology and connected component analysis; 

(b) analyzing motion of said detected objects comprising 
the step of tracking said detected objects using linear 
prediction of object position; 

45 (c) generating a symbolic motion description of object 
motion; and 

(d) placing index marks in said symbolic motion descrip- 
tion to identify occurrence of events in the video. 

2. The method of claim 1 wherein said step of detecting 
so objects includes performing motion segmentation compris- 
ing the steps of image differencing, thresholding, connected 
component analysis, image smoothing and morphology. 

3. The method of claim 1 wherein said analyzing step 
includes associating objects in successive frames of said 

55 video using a mutual-nearest-neighbor criterion. 

4. The method of claim 1 wherein said analyzing step 
includes determining paths of said objects and intersection 
with paths of other objects. 

5. The method of claim 1 wherein said generating step 
60 includes generating a directed graph reflective of the paths 

and path intersections of said objects. 

6. The method of claim 1 wherein said generating step 
includes generating a record of image statistics for objects in 
every video frame. 

65 7. The method of claim 6 wherein said generating a record 
of image statistics step includes generating size, shape, 
position, and time-stamp of objects in every video frame. 



01/14/2004, EAST Version: 1.4.1 



5,969,755 



17 



18 



8. The method of claim 1 wherein said generating step 
includes generating hierarchical graph node groupings 
reflective of paths and intersections of said objects. 

9. The method of claim 8 wherein said placing index 
marks step includes placement of index marks correspond- 
ing to motion events in accordance with said hierarchical 
graph node groupings. 

10. The method of claim 9 wherein said placing index 
marks step includes use of a rule-based classifier. 

11. The method of claim 9 wherein said placing index 
marks step corresponding to motion events includes place- 
ment of one or more marks corresponding to appearance, 
disappearance, deposit, removal, entrance, exit, motion, or 
rest of objects. 

12. A method of providing video indexing comprising the 
steps of: 

(a) detecting objects in a video to provide detected 
objects; 

(b) analyzing motion of said detected objects; 

(c) generating a symbolic motion description of object 
motion; and 

(d) placing index marks in said symbolic motion descrip- 
tion to identify occurrence of events in the video, said 
generating step includes generating primary and sec- 
ondary graph links reflective of the likelihood of accu- 
rate motion analysis. 

13. A method of providing video indexing comprising the 
steps of: 

(a) detecting objects in a video to provide detected 
objects; 

(b) analyzing motion of said detected objects; 

(c) generating a symbolic motion description of object 
motion; and 

(d) placing index marks in said symbolic motion descrip- 
tion to identify occurrence of events in the video, said 
generating step includes generating hierarchial graph 
node groupings reflective of paths and intersections of 
said objects, said hierarchial graph node groupings 
reflect the likelihood of accurate motion analysis. 

14. A method for real-time detection of video events 
comprising the steps of: 

(a) detecting object in real-time video to provide detected 
objects; said step of detecting including performing 
motion segmentation comprising the steps of image 
differencing, thresholding, connected components 
analysis and morphology; 

(b) analyzing motion of said detected objects including 
the step of tracking said detected objects using linear 
prediction of object positions; 

(c) generating a symbolic motion description of object 
motion; 

(d) placing index marks in said symbolic motion descrip- 
tion to identify occurrence of events in video; and 

(e) providing a signal in response to the occurrence of said 
video events. 



10 



20 



25 



45 



50 



55 



15. The method of claim 14 wherein said step of providing 
a signal for the purpose of sounding an alarm. 

16. The method of claim 14 wherein said step of providing 
a signal to make a record. 

17. The method of claim 15 wherein said step of providing 
a signal comprises initiation of an automated action by 
signaling one of a computer program or electronic device. 

18. A method to assist human analysis of video data 
comprising the steps of: 

(a) detecting objects in a video to provide detected 
objects; said detecting step including performing 
motion segmentation comprising the steps of image 
differencing, thresholding, connected component 
analysis and morphology; 

(b) analyzing motion of said objects; said analyzing step 
including the step of tracking said detected objects 
using linear prediction of said object positions; 

(c) generating a symbolic motion description of object 
motion; 

(d) placing index marks in said symbolic motion descrip- 
tion to identify occurrence of events in video; 

(e) receiving content-base queries; 

(f) matching queries with symbolic video information and 
said index marks; and 

(g) providing video sequences corresponding to the query. 

19. The method of claim 18 wherein said step of receiving 
content-base queries includes receiving queries with con- 
straints involving one or more of a video clip, time interval, 
object, spatial region, or motion event. 

20. The method of claim 19 wherein constraints for said 
queries may be specified by manipulation of video clips. 

21. The method of claim 19 wherein said step of matching 
queries with symbolic video information includes filtering 
symbolic information to meet the query constraints of one or 
more video clip, time interval, object, spatial region, or 
motion event. 

22. A method to assist human analysis of video data 
comprising the steps of: 

(a) detecting objects in a video to provide detected 
objects; 

(b) analyzing motion of said objects; 

(c) generating a symbolic motion description of object 
motion; 

(d) placing index marks in said symbolic motion descrip- 
tion to identify occurrence of events in video; 

(e) receiving content-based queries; 

(f) matching queries with symbolic video information and 
said index marks; and 

(g) providing video sequences corresponding to the query, 
said step of providing video sequences includes a 
system clipboard with sets of video clips for progres- 
sive refinement of content based queries and query 
results. 
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